[feature](iceberg) Support reading Iceberg variant from Parquet#63192
[feature](iceberg) Support reading Iceberg variant from Parquet#63192eldenmoon wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
cab85b8 to
e9e3bfd
Compare
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
Summary:
I found one blocking issue in the added regression test: the local single-BE copy path does not match the file_path used by the local TVF, so the test can fail in the exact environment that branch is intended to support.
Critical checkpoint conclusions:
- Goal: add Iceberg/Parquet VARIANT read support, including shredded projection. The implementation and regression coverage mostly target that goal, but the new regression test has a path setup bug.
- Scope: the production changes are focused on Parquet schema parsing, variant reconstruction, column pruning, and Iceberg type mapping.
- Concurrency/lifecycle: no new shared mutable concurrent state or non-obvious lifecycle ownership issue found in the reviewed PR diff.
- Configuration/compatibility: no new config items or persisted storage-format changes found; FE/BE type mapping paths for Iceberg VARIANT are updated.
- Parallel paths: Hive, Iceberg, and local Parquet pruning paths were considered; the test issue is distinct from production pruning logic.
- Tests: regression coverage was added, but the local-file staging logic can make the new test fail before validating the feature.
- Observability/performance: added ParquetReadColumnPaths profile string is useful for validating pruning; no blocking observability or hot-path issue found beyond the test blocker.
User focus: no additional user-provided review focus was present.
There was a problem hiding this comment.
I found a correctness blocker in the new shredded VARIANT pruning logic. The implementation prunes the unshredded value leaf whenever a matching typed_value path exists, but Iceberg shredded VARIANT can still carry residual/unrepresentable values for that field in value, so queries can silently return NULL or partial objects for those rows.
Critical checkpoint conclusions:
- Goal/test: the PR adds Iceberg v3 VARIANT reading and pruning tests, but the tests only cover fully typed shredded fields and do not prove residual fallback correctness.
- Scope/focus: the change is mostly focused on Parquet/Iceberg VARIANT support.
- Concurrency/lifecycle/config: no new concurrency, non-trivial lifecycle, or config behavior found in the reviewed PR diff.
- Compatibility: adds new type mapping; no storage-format persistence changes found.
- Parallel paths: the same pruning issue exists in both standalone/Hive Parquet and Iceberg Parquet helper paths.
- Tests: missing mixed shredded/residual cases where a selected typed path is absent or has an incompatible type in
typed_valuebut exists invalue. - Observability/performance: profile string helps inspect selected leaves; no additional blocking observability issue found.
- Data correctness: blocking issue below can cause incorrect query results after column pruning.
No additional user-provided focus points were present.
e9e3bfd to
5fe9ca5
Compare
|
run buildall |
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
Reviewed the full PR with the Doris code-review checklist. I found a blocking correctness issue in nested VARIANT pruning for deeper accesses under shredded fields. Goal/test: the PR adds Iceberg/Parquet VARIANT read support and includes a regression for top-level shredded field reads, but it does not cover deeper access through a shredded field residual. Scope: the change is focused, but the duplicated Hive/Iceberg pruning helper needs the same fix. Concurrency/lifecycle/config/transaction/persistence: no new concurrency, lifecycle, config, transaction, storage-format write, or persistence concerns found. Parallel paths: the same issue exists in both Hive/local Parquet and Iceberg Parquet helpers. Tests: existing tests cover unshredded reads and top-level shredded field pruning, but should add a mixed residual case such as v['metric']['x'] where metric is shredded and some rows store an object in typed_value.metric.value. Observability: the new profile string is useful for validating selected leaves. User focus: no additional user-provided review focus was supplied.
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test coverage: the PR adds Iceberg VARIANT Parquet reading, type mapping, pruning observability, and regression/unit coverage. The main scenario is covered, but a case-sensitive key path is not covered and currently regresses correctness.
- Scope/focus: the change is focused on Iceberg/Parquet VARIANT support, though duplicated Hive/Iceberg pruning helpers carry the same issue.
- Concurrency/lifecycle: no new shared mutable state, threads, locks, or static initialization hazards found in the reviewed paths.
- Configuration/compatibility: no new configs or storage-format writes; this is a reader/type-mapping change. Mixed files with non-VARIANT types continue through existing paths.
- Parallel paths: the Hive/local and Iceberg Parquet pruning paths were both reviewed; both have the same case-sensitivity bug and are commented separately.
- Error handling/memory: Status returns in the new reader path are generally propagated; no ignored Status or untracked large persistent allocation issue found beyond the correctness issue raised.
- Data correctness: blocking issue found: shredded VARIANT field lookup lowercases user path components, so distinct keys such as
aandAcan be pruned/read as the same field. - Observability/performance: the profile leaf-path observable is useful; no additional blocker found.
User focus: no additional user-provided review focus was specified.
5fe9ca5 to
fa098c0
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 29804 ms |
TPC-H: Total hot run time: 29785 ms |
FE UT Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
I found two additional correctness issues in the Parquet VARIANT reconstruction path.
Critical checkpoint conclusions:
- Goal/test: The PR adds Iceberg v3 Parquet VARIANT reading and shredded column pruning, with regression and FE unit coverage for the common unshredded/shredded object paths. However, valid shredded array layouts and non-finite shredded floats are not covered and can decode incorrectly or fail.
- Scope/focus: The implementation is mostly focused on the new reader/pruning behavior, though the duplicated Hive/Iceberg pruning helper logic remains a maintainability risk rather than a blocker.
- Concurrency/lifecycle: The reviewed changes are per-reader/per-query state and do not introduce new shared mutable state, locks, background threads, or special static lifecycle dependencies.
- Config/compatibility: No new config items or storage-format writes are introduced. The change reads a standard Parquet/Iceberg format; mixed-version compatibility concerns are limited to reader capability.
- Parallel paths: Hive/local and Iceberg pruning paths are both updated. Standalone Parquet uses the Hive-style name pruning helper.
- Conditional checks: The main conditional logic around shredded
value/typed_valuefollows the Parquet Variant Shredding spec for object fields, but the array element check currently misses a spec-valid layout. - Tests/results: Existing tests cover top-level object shredding, deeper residual paths, case-sensitive keys, and profile observability. Missing coverage remains for arrays whose element group omits
value, and for NaN/Inf in shredded float/double typed values. - Observability: The added
ParquetReadColumnPathsprofile string is useful for pruning verification and appears lightweight. - Transactions/persistence/data writes: Not applicable; this is read-path only.
- FE/BE variable passing: Iceberg type mapping and access-path rewriting are updated for VARIANT; no additional thrift variable propagation issue found.
- Performance: The JSON reconstruction path is inherently allocation-heavy but limited to VARIANT decoding. No additional performance blocker found beyond the correctness issues below.
No additional user-provided review focus was specified.
TPC-DS: Total hot run time: 170353 ms |
TPC-DS: Total hot run time: 169901 ms |
fa098c0 to
bf0c548
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found additional blocking compatibility issues for valid Parquet VARIANT shredding layouts that omit value when data is fully represented by typed_value. Existing review threads already cover non-finite typed floats, typed array elements, case-sensitive keys, and value/residual pruning, so I did not duplicate those.
Critical checkpoints:
- Goal/test: the PR aims to read Iceberg/Parquet VARIANT, including shredded layouts, and adds local TVF regression coverage, but coverage does not include typed-value-only top-level or nested shredded field groups.
- Scope/focus: the change is focused, but the schema/pruning logic is stricter than the Parquet shredding layout it is trying to support.
- Concurrency/lifecycle/config/transactions/persistence: no new concurrency, lifecycle, config, transaction, or persistence concerns found in the reviewed paths.
- Parallel paths: Hive/local and Iceberg pruning have duplicated logic; both need the typed-value-only fix.
- Compatibility/data correctness: current code rejects or prunes away valid typed-value-only shredded data, causing scan failure or null/missing results.
- Tests: existing tests cover unshredded/shredded happy paths and several pruning observables, but miss the typed-value-only layouts described in the inline comments.
- Observability/performance: no additional observability or performance blocker found beyond the added profile string being used by tests.
- User focus: no additional user-provided review focus was present.
TPC-H: Total hot run time: 29314 ms |
|
/review |
There was a problem hiding this comment.
I found one additional distinct correctness issue in the access-path collection logic.
Critical checkpoint conclusions:
- Goal/test: the PR aims to support Iceberg/Parquet VARIANT reading and nested pruning; coverage is broad, but this comparison-predicate path is still untested.
- Scope/focus: the changes are generally focused, but VARIANT root-demand propagation remains incomplete for some non-VARIANT consumers.
- Concurrency/lifecycle/config/compatibility: no additional concerns found in the reviewed paths.
- Parallel paths: the same access-path invariant must hold for scalar functions, casts, generators, aliases, and predicates; this finding covers a predicate path not covered by existing threads.
- Data correctness: blocking issue below can evaluate predicates against a pruned/truncated VARIANT value.
- Tests: please add regression coverage combining a whole-VARIANT comparison predicate with a sibling nested VARIANT predicate.
- Observability/performance: no additional observability issue found; pruning performance work should not compromise correctness.
User focus: no additional user-provided review focus was specified.
TPC-H: Total hot run time: 31087 ms |
|
/review |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 169611 ms |
There was a problem hiding this comment.
Automated review summary for PR 63192.
I did not find a distinct new inline issue beyond the already-open review threads, so I am not duplicating those comments. However, the existing threads still describe critical correctness gaps in VARIANT nested-column pruning and Parquet/Iceberg VARIANT reconstruction/read validation, including cases that can silently return truncated or wrong VARIANT data, crash/fail valid reads, or miss required validation for unsupported nested VARIANT paths. These are blocking until resolved or convincingly addressed in-thread.
Critical checkpoint conclusions:
- Goal/test proof: the PR aims to support Iceberg/Parquet VARIANT read/pruning and reject unsupported writes. It adds unit/regression coverage, but existing review threads identify important legal query/file shapes not covered or not handled correctly.
- Scope/focus: the change is broad across FE pruning, BE Parquet readers, Iceberg/Hive readers, and tests; several known issues indicate cross-path behavior is not yet consistently handled.
- Concurrency/lifecycle/config: I did not identify a new concurrency, lifecycle, or config-specific issue in this pass.
- Compatibility: read/write support intersects Iceberg format/version and mixed external paths; existing validation comments around ORC/nested VARIANT and write/delete-only behavior remain important compatibility/correctness checkpoints.
- Parallel code paths: existing comments cover multiple parallel paths that need consistent handling, including local/Hive/Iceberg Parquet, OLAP/file scans, generator/project/CTE-style access propagation, MAP/ARRAY offset/null paths, and ORC rejection.
- Error handling/data correctness: existing issues include silent wrong pruning, valid-data reconstruction failures, and invalid physical-column/stat access risks; these are data correctness blockers.
- Test coverage: new tests exist, but the known threads call out missing coverage for several edge cases, including Iceberg table path behavior, nested VARIANT in complex ORC columns, dynamic/whole-VARIANT access combinations, typed/null complex elements, and MAP/ARRAY offset/null reads.
- Performance/observability: profile assertions were added for some pruning behavior; existing offset-only MAP/ARRAY comments indicate intended performance optimizations are incomplete in some paths.
- User focus: no additional user-provided review focus was specified.
Please address the existing inline threads before this can be approved.
TPC-H: Total hot run time: 30841 ms |
TPC-DS: Total hot run time: 170044 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found two blocking correctness issues. The VARIANT access-path collector still drops subpath requirements across casts whose child is VARIANT, so sibling VARIANT predicates can cause scans to prune away data required by projections. The delete-only Iceberg MERGE change avoids write-side VARIANT validation, but the plan still projects all target columns and can therefore hit read-side VARIANT rejection or unnecessary VARIANT reads.
Critical checkpoints: goal is partially met but not for casted VARIANT subpaths or delete-only MERGE on tables with VARIANT columns; tests do not cover these residual cases. The changes are mostly focused, but the MERGE fix is incomplete because it only gates sink validation and not the read/projection path. No new concurrency or lifecycle risks were identified. No new configuration or storage-format compatibility changes were found. Parallel code paths were checked across FE pruning and Iceberg read/write validation; the read-side MERGE path remains uncovered. Test coverage is insufficient for the two scenarios below. Observability is not the primary concern for these issues. No transaction/persistence changes are involved. User focus: no additional user-provided review focus was specified.
| return continueCollectAccessPath(child, variantRootContext); | ||
| } | ||
| if (!context.accessPathBuilder.isEmpty() | ||
| && cast.getDataType() instanceof NestedColumnPrunable |
There was a problem hiding this comment.
This cast handling still drops a non-empty VARIANT access suffix whenever the cast child is VariantType, because VariantType is not NestedColumnPrunable and the method falls through to visiting the child with a fresh empty context. For example, SELECT cast(v AS variant)['a'] FROM variant_tbl WHERE v['k'] IS NOT NULL reaches visitCast() with suffix a; the suffix is discarded here, the sibling predicate can leave only [v, k] at the scan, and the projection then evaluates ['a'] against a pruned VARIANT. The same pattern applies to casts from a VARIANT subpath into complex types, e.g. selecting a field from cast(v['obj'] AS struct<...>). Please propagate the access builder through VARIANT-preserving / VARIANT-to-complex casts where the suffix still refers to the child value, and add coverage combining such a casted subpath projection with a sibling VARIANT predicate.
| outputExprs, | ||
| deleteCtx, | ||
| writesDataFiles(matchedClauses, notMatchedClauses), | ||
| Optional.empty(), |
There was a problem hiding this comment.
Passing writesDataFiles=false only disables the sink-side VARIANT write validation; it does not stop the delete-only plan from reading every visible target column. buildDeleteProjection() still appends all visible columns, so a delete-only MERGE whose join/predicate uses only non-VARIANT columns will still materialize a VARIANT column. On an ORC Iceberg table this reaches IcebergScanNode.validateVariantReadSupported() and fails with the Parquet-only VARIANT read error even though the delete only needs row-id/delete metadata, and on Parquet it performs unnecessary VARIANT reads. This is distinct from the earlier write-validation thread: the remaining failure is introduced by the read/projection side after the new flag is passed. Please make delete-only MERGE project only the operation, row id, lineage/delete columns, and columns required by the join/predicate, or otherwise avoid scanning unused VARIANT columns; add coverage for delete-only MERGE on a table with an unused VARIANT column, including a non-Parquet table if possible.
TPC-H: Total hot run time: 32090 ms |
TPC-DS: Total hot run time: 168881 ms |
|
/review |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve? Issue Number: N/A Related PR: apache#63192 Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VariantShredding wrapper shape, decodes metadata/value residual data, reads shredded typed_value columns, and prunes shredded Parquet leaves for accessed variant paths. The VARIANT reader and planner changes stay scoped to the Iceberg/Parquet VARIANT path instead of coupling generic nested-column code to Iceberg-only behavior. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, with counter coverage to catch row-wise performance regressions. Selected residual or complex layouts still fall back to row-wise reconstruction. This also preserves VARIANT subpaths through casts, validates the actual Iceberg data-file format for VARIANT reads, rejects duplicate VariantShredding structural children, preserves null temporal typed leaves without reading their physical value, and keeps delete-only Iceberg MERGE projections from reading unused visible target data columns. ### Release note Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error. ### Check List (For Author) - Test: Regression test / Unit Test / Manual test - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed) - Unit Test: ./run-be-ut.sh --run -f 'ParquetVariantReaderTest.RowWisePreservesNullComplexTypedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (2 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*:NestedColumnAccessHelperTest.*' (127 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed) - Unit Test: ./run-be-ut.sh --run --filter=ParquetVariantReaderTest.RejectVariantSchemaWithDuplicateStructuralChild:ParquetVariantReaderTest.DirectTypedOnlyPreservesTemporalLeafNull (2 tests passed; rerun after clang-format also passed) - Unit Test: ./run-be-ut.sh --run --filter=ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath (1 test passed after latest changes) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantComparisonPredicateCollectsWholeVariantOperand (1 test passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantCastProjectionKeepsSubPathWithSiblingPredicate (1 test passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (70 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest#testExplodeSubqueryJoinAggAccessPaths (1 test passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest#testValidateVariantDataFileFormatRejectsOrcSplit (1 test passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (6 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.trees.plans.commands.IcebergMergeCommandTest#testDeleteProjectionDoesNotReadVisibleTargetColumns (1 test passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed) - Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available. - Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree. - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh - Manual test: git diff --check - Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests - Static analysis: CLANG_TIDY_BINARY=/tmp/clang-tidy-resource-filter build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (passed for changed lines after adding the clang-tidy resource-dir and filtering a pre-existing be/src/core/types.h clang-tidy-nolint diagnostic; the unwrapped script was blocked by that existing header diagnostic) - Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed VariantShredding schemas and missing present shredded array payloads, preserves null complex/temporal typed values and explicit Variant null array elements, forces root access for whole-VARIANT scalar/comparison consumers while preserving literal subpath pruning for typed reads, recursively rejects Iceberg VARIANT reads from non-Parquet data files during scan planning, avoids reading unused target data columns for delete-only Iceberg MERGE, and rejects Iceberg VARIANT data-file writes explicitly. - Does this need documentation: No
|
/review |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
I found one blocking correctness issue in the current PR.
Critical checkpoint conclusions:
- Goal/test proof: The PR aims to add Iceberg/Parquet VARIANT read support and nested pruning coverage. There is broad test coverage, but the non-Parquet Iceberg validation is not proven for pruned complex columns that contain an unselected VARIANT sibling.
- Scope/focus: The change is mostly focused on VARIANT read/pruning support, but the new read validation is applied at the original column level rather than the materialized/pruned slot level.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, or config issue found in the reviewed paths. The issue is a behavioral regression for valid ORC Iceberg reads of non-VARIANT subfields.
- Parallel paths: This affects both table-format validation and per-data-file validation because both call findVariantReadColumnName().
- Tests: Existing tests cover nested VARIANT rejection, but not the inverse case where nested pruning removes the VARIANT child before reading an ORC complex column.
- Observability/transactions/data writes: No additional concern found beyond the inline issue.
- Performance: No separate blocking performance issue found.
- User focus: No additional user-provided review focus was specified.
| for (SlotDescriptor slot : desc.getSlots()) { | ||
| Column column = slot.getColumn(); | ||
| if (containsVariantType(column.getType())) { | ||
| return column.getName(); |
There was a problem hiding this comment.
This validation uses slot.getColumn().getType(), which is the original Iceberg table column type, not the materialized slot type after nested-column pruning. That over-rejects legal non-Parquet reads when only non-VARIANT subfields are selected. For example, an ORC Iceberg column s STRUCT<a:int, v:variant> queried as SELECT s.a FROM tbl can be pruned/materialized as STRUCT<a:int> and does not read any VARIANT data, but this still sees the original s.v and throws the Parquet-only VARIANT error. The same false positive applies to validateVariantDataFileFormat() because it calls this helper too. Please validate the pruned/materialized slot type or selected access paths instead, and add coverage for reading a non-VARIANT subfield from a complex ORC Iceberg column that has an unselected VARIANT sibling.
There was a problem hiding this comment.
I found several correctness issues that should be addressed before merge. Critical checkpoint conclusions: the PR goal appears to be adding Iceberg/Parquet VARIANT read/write support and nested pruning; the implementation and tests cover many paths, but the current code still fails valid VARIANT array values and over-rejects non-Parquet Iceberg reads that do not actually materialize VARIANT data. The change is broad and not fully isolated due to shared FE pruning and BE reader paths. I did not find new concurrency, lifecycle, dynamic config, persistence, or transaction concerns in the reviewed paths. Compatibility risk is limited to new VARIANT support, but mixed file-format validation needs to distinguish selected data from full catalog schema. Test coverage is substantial, but missing coverage for heterogeneous/complex VARIANT array materialization and ORC Iceberg complex columns where only non-VARIANT subfields are selected. No additional user-provided review focus was specified.
| } | ||
|
|
||
| private String findVariantReadColumnName() { | ||
| for (SlotDescriptor slot : desc.getSlots()) { |
There was a problem hiding this comment.
This now checks the full catalog column type, so it rejects non-Parquet Iceberg scans even when pruning/materialization only reads non-VARIANT subfields. For example, an ORC table with s STRUCT<a INT, v VARIANT> and a query SELECT s.a FROM t can materialize only s.a, but slot.getColumn().getType() still contains s.v and findVariantReadColumnName() throws the Parquet-only VARIANT error. The validation needs to inspect the selected/pruned access paths (or the effective materialized type) and reject only when a VARIANT subfield is actually read. This is distinct from the earlier recursive-validation threads: the recursive check was needed for selected nested VARIANTs, but applying it to the whole catalog column over-rejects legal non-VARIANT reads.
| const size_t elements = array.size(); | ||
| value.field = Field::create_field<TYPE_ARRAY>(std::move(array)); | ||
| fill_field_type_info(&value); | ||
| if (value.base_scalar_type_id == INVALID_TYPE) { |
There was a problem hiding this comment.
When a residual VARIANT array cannot be represented with one scalar element type, this fallback replaces every decoded element with null. Valid VARIANT values such as [1, "x"], [{}, 1], or arrays containing objects reach fill_field_type_info() with INVALID_TYPE, then are stored as JSONB [null, ...], silently losing the actual element values. The fallback should serialize the decoded array contents (or otherwise preserve the VARIANT structure), not synthesize a null array except for genuinely null elements; please add coverage for heterogeneous and complex residual arrays.
| FieldWithDataType value; | ||
| const size_t elements_count = array.size(); | ||
| value.field = Field::create_field<TYPE_ARRAY>(std::move(array)); | ||
| fill_variant_field_info(&value); |
There was a problem hiding this comment.
The typed/shredded array path has the same data-loss fallback: after each element is reconstructed into array, any array whose inferred base_scalar_type_id is INVALID_TYPE is converted to [null, ...]. A valid typed VARIANT array with struct/list/map elements or mixed element kinds will therefore return the correct length but all element values as null. Please preserve the reconstructed element values when materializing JSONB and add typed-array coverage for non-null complex/mixed elements.
TPC-H: Total hot run time: 31048 ms |
TPC-DS: Total hot run time: 168382 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
feature Support reading Iceberg variant from Parquet
What problem does this PR solve?
Issue Number: N/A
Related PR: #63192
Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VariantShredding wrapper shape, decodes metadata/value residual data, reads shredded typed_value columns, and prunes shredded Parquet leaves for accessed variant paths. The VARIANT reader and planner changes stay scoped to the Iceberg/Parquet VARIANT path instead of coupling generic nested-column code to Iceberg-only behavior. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, with counter coverage to catch row-wise performance regressions. Selected residual or complex layouts still fall back to row-wise reconstruction. This also preserves VARIANT subpaths through casts, validates the actual Iceberg data-file format for VARIANT reads, rejects duplicate VariantShredding structural children, preserves null temporal typed leaves without reading their physical value, and keeps delete-only Iceberg MERGE projections from reading unused visible target data columns.
Release note
Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error.
Check List (For Author)
Test: Regression test / Unit Test / Manual test
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed)
Unit Test: ./run-be-ut.sh --run -f 'ParquetVariantReaderTest.RowWisePreservesNullComplexTypedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (2 tests passed)
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap)
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.:NestedColumnAccessHelperTest.' (127 tests passed)
Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed)
Unit Test: ./run-be-ut.sh --run --filter=ParquetVariantReaderTest.RejectVariantSchemaWithDuplicateStructuralChild:ParquetVariantReaderTest.DirectTypedOnlyPreservesTemporalLeafNull (2 tests passed; rerun after clang-format also passed)
Unit Test: ./run-be-ut.sh --run --filter=ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath (1 test passed after latest changes)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantComparisonPredicateCollectsWholeVariantOperand (1 test passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantCastProjectionKeepsSubPathWithSiblingPredicate (1 test passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (70 tests passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest#testExplodeSubqueryJoinAggAccessPaths (1 test passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest#testValidateVariantDataFileFormatRejectsOrcSplit (1 test passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (6 tests passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.trees.plans.commands.IcebergMergeCommandTest#testDeleteProjectionDoesNotReadVisibleTargetColumns (1 test passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed)
Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available.
Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree.
Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh
Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh
Manual test: git diff --check
Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests
Static analysis: CLANG_TIDY_BINARY=/tmp/clang-tidy-resource-filter build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (passed for changed lines after adding the clang-tidy resource-dir and filtering a pre-existing be/src/core/types.h clang-tidy-nolint diagnostic; the unwrapped script was blocked by that existing header diagnostic)
Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed VariantShredding schemas and missing present shredded array payloads, preserves null complex/temporal typed values and explicit Variant null array elements, forces root access for whole-VARIANT scalar/comparison consumers while preserving literal subpath pruning for typed reads, recursively rejects Iceberg VARIANT reads from non-Parquet data files during scan planning, avoids reading unused target data columns for delete-only Iceberg MERGE, and rejects Iceberg VARIANT data-file writes explicitly.
Does this need documentation: No